install.packages("tidyverse")
If the install fails, then run
install.packages(c("dplyr","tidyr","ggplot2"))
The installation only needs to be done once. But we will need to load the packages in every R session where we want to use them. To load the packages, use
library("dplyr")
library("tidyr")
library("ggplot2")
alternatively, you can load the entire (not very big)
tidyverse.
library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4.9000 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Change theme to a better one
theme_set(theme_bw()) # comment this line to see what default plots look like
The main purpose of the lab today is to construct plots using the
ggplot2 R package. In order to construct these plots, we
need to construct an appropriate data.frame and we will use
dplyr to help us construct that
data.frame.
Let’s use the built-in R data set airquality. Before we
start plotting let’s take a quick look at the data.
dim(airquality)
## [1] 153 6
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
tail(airquality)
## Ozone Solar.R Wind Temp Month Day
## 148 14 20 16.6 63 9 25
## 149 30 193 6.9 70 9 26
## 150 NA 145 13.2 77 9 27
## 151 14 191 14.3 75 9 28
## 152 18 131 8.0 76 9 29
## 153 20 223 11.5 68 9 30
summary(airquality)
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
For built in datasets, we can get more information by going to the help file.
?airquality
One issue with this dataset is that the Month/Day columns don’t
really provide us with a Date. Let’s create a new column
that creates a real Date.
airquality <- airquality %>%
dplyr::mutate(Date = as.Date(paste("1973",Month,Day,sep="/")))
If you deal with dates a lot, you should check out the lubridate package.
All ggplot2 graphics require a data.frame
containing the data and this data.frame is always the first
argument to a ggplot call. After this, we specify some
aesthetics using the aes() function. Then, we tell
ggplot2 what kind of graphics to construct.
ggplot(airquality, # data.frame containing the data
aes(x=Ozone)) + # a column name from the data.frame
geom_histogram() # create a histogram
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).
If you want to avoid the message, you can specify the number of bins to use.
ggplot(airquality, aes(x=Ozone)) +
geom_histogram(bins = 40)
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).
If you want plot on the density scale (so that you can compare to a pdf), use the following:
ggplot(airquality, aes(x=Ozone)) +
geom_histogram(aes(y=..density..), bins = 40)
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).
Create a histogram of solar radiation on the density scale with 50 bins.
ggplot(airquality, aes(x=Solar.R)) +
geom_histogram(aes(y=..density..), bins = 50)
## Warning: Removed 7 rows containing non-finite outside the scale range
## (`stat_bin()`).
The syntax for boxplots is similar except that the variable you are interest in is the y aesthetic.
ggplot(airquality,
aes(x=1,y=Ozone)) +
geom_boxplot()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Comparing boxplots
ggplot(airquality,
aes(x=Month, y=Ozone, group=Month)) +
geom_boxplot()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Create boxplots of wind speed by month. Bonus: See if you can google to find out how to swap the axes, i.e. have Month on the y-axis and Wind on the x-axis.
ggplot(airquality,
aes(x=Month, y=Wind, group=Month)) +
geom_boxplot(outlier.shape = NA, color='grey') +
coord_flip()
At this point we can construct individual graphs for our 4 different response variables: Ozone, Solar.R, Wind, and Temp. Perhaps we want to understand the temporal variability for Ozone. We can use a scatterplot of Ozone vs Date.
ggplot(airquality, aes(x = Date, y = Ozone)) +
geom_point()
## Warning: Removed 37 rows containing missing values or values outside the scale
## range (`geom_point()`).
or if we wanted a line plot
ggplot(airquality, aes(x = Date, y = Ozone)) +
geom_line()
Notice that the line is disconnected wherever we have missing data.
Perhaps we want to understand the relationship between solar radiation and ozone.
ggplot(airquality, aes(x = Solar.R, y = Ozone)) +
geom_point()
## Warning: Removed 42 rows containing missing values or values outside the scale
## range (`geom_point()`).
Create a scatterplot of wind speed versus temperature.
ggplot(airquality, aes(x = Temp, y = Wind)) +
geom_point()
Scatterplots don’t look so good when there are data points that overlap. For example, when plotting Ozone vs Month the points may overlap due to Month only having 5 values in the data set.
ggplot(airquality,
aes(x=Month, y=Ozone, group=Month)) +
geom_point()
## Warning: Removed 37 rows containing missing values or values outside the scale
## range (`geom_point()`).
So, instead we will typically jitter the points a bit to remove the overlap, e.g.
ggplot(airquality,
aes(x=Month, y=Ozone, group=Month)) +
geom_jitter()
## Warning: Removed 37 rows containing missing values or values outside the scale
## range (`geom_point()`).
Now, we can combine the boxplots we discussed earlier with scatterplots or jittered scatterplots, e.g.
ggplot(airquality,
aes(x=Month, y=Ozone, group=Month)) +
geom_boxplot(color='grey', # make the boxes not so obvious
outlier.shape = NA) + # remove outliers,
geom_point() # because they get plotted here
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 37 rows containing missing values or values outside the scale
## range (`geom_point()`).
or
ggplot(airquality,
aes(x=Month, y=Ozone, group=Month)) +
geom_boxplot(color='grey', # make the boxes not so obvious
outlier.shape = NA) + # remove outliers,
geom_jitter() # because they get plotted here
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 37 rows containing missing values or values outside the scale
## range (`geom_point()`).
Create a scatterplot of wind speed by month and add a boxplot for each month in the background.
ggplot(airquality,
aes(x=Month, y=Wind, group=Month)) +
geom_boxplot(outlier.shape = NA, color='grey') +
geom_jitter() +
coord_flip()
If we want to put all the response variables on the same plot, we can
color them. In order to do this, we will need to organize our
data.frame into long format.
airquality_long <- airquality %>%
dplyr::select(-Month, -Day) %>% # Remove these columns
tidyr::pivot_longer(-Date,
names_to = "response",
values_to = "value")
Take a look at the resulting data.frame.
dim(airquality)
## [1] 153 7
dim(airquality_long)
## [1] 612 3
head(airquality_long)
## # A tibble: 6 × 3
## Date response value
## <date> <chr> <dbl>
## 1 1973-05-01 Ozone 41
## 2 1973-05-01 Solar.R 190
## 3 1973-05-01 Wind 7.4
## 4 1973-05-01 Temp 67
## 5 1973-05-02 Ozone 36
## 6 1973-05-02 Solar.R 118
summary(airquality_long)
## Date response value
## Min. :1973-05-01 Length:612 Min. : 1.00
## 1st Qu.:1973-06-08 Class :character 1st Qu.: 13.00
## Median :1973-07-16 Mode :character Median : 66.00
## Mean :1973-07-16 Mean : 80.06
## 3rd Qu.:1973-08-23 3rd Qu.: 91.00
## Max. :1973-09-30 Max. :334.00
## NA's :44
table(airquality_long$response)
##
## Ozone Solar.R Temp Wind
## 153 153 153 153
ggplot(airquality_long,
aes(x = Date, y = value,
linetype = response,
color = response,
group = response)) +
geom_line()
Notice that the legend is automatically created. This is not something that is done in base R graphics.
Honestly, this doesn’t look very good, so it is better to facet the plot.
Facets are often a better way of representing multiple variables.
ggplot(airquality_long, aes(Date, value)) +
geom_point() +
facet_wrap(~response)
## Warning: Removed 44 rows containing missing values or values outside the scale
## range (`geom_point()`).
Since the axes are quite different for the different responses, we can allow them to vary in the different facets.
ggplot(airquality_long, aes(Date, value)) +
geom_line() +
facet_wrap(~response,scales="free_y")
Alternatively, we can use facet_grid which is more
useful when you have more variables you want to facet by.
ggplot(airquality_long, aes(Date, value)) +
geom_line() +
facet_wrap(.~response,scales="free_y")
or
ggplot(airquality_long, aes(Date, value)) +
geom_line() +
facet_wrap(response ~ .,scales="free_y")
If we only had the long version of the data.frame, we
can reconstruct the wide version by using the following
airquality2 <- airquality_long %>%
tidyr::pivot_wider(
names_from = response,
values_from = value)
Sometimes it is helpful to save the plot as an R object so that it can be updated in the future. To save the plot, just use the assignment operator, e.g.
g <- ggplot(airquality2,
aes(x = Temp, y = Wind)) +
geom_point()
g # Then you can see the plot by just typing the object name
We would like this plot to be a bit more informative, so we will add some informative labels.
g <- g +
labs(x = "Temperature (F)",
y = "Wind speed (mph)",
title = "New York (May-September 1973)")
g
As you have seen before, we can also change the theme. I prefer the simple “bw” theme, but here is the default theme.
g <- g + theme_classic()
g
We can add a regression line.
g <- g + geom_smooth(method="lm")
g
## `geom_smooth()` using formula = 'y ~ x'
Alternatively, you can combine all the steps
ggplot(airquality2,
aes(x = Temp, y = Wind)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Temperature (F)",
y = "Wind speed (mph)",
title = "New York (May-September 1973)")
## `geom_smooth()` using formula = 'y ~ x'
Use the cars dataset to construct and customize a figure
displaying the relationship between the stopping distance and speed of a
car.
ggplot(cars,
aes(x=speed, y=dist)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Speed (mph)",
y = "Stopping distance (ft)",
title = "Stopping distance as a function of speed (1920s)")
## `geom_smooth()` using formula = 'y ~ x'
If you want to save the plot, use the ggsave function, e.g.
ggsave(filename = "plot.png",
plot = g,
width = 5,
height = 4)